10 research outputs found
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Pre-trained large language models (LLMs) have recently achieved better
generalization and sample efficiency in autonomous web navigation. However, the
performance on real-world websites has still suffered from (1) open domainness,
(2) limited context length, and (3) lack of inductive bias on HTML. We
introduce WebAgent, an LLM-driven agent that can complete the tasks on real
websites following natural language instructions. WebAgent plans ahead by
decomposing instructions into canonical sub-instructions, summarizes long HTML
documents into task-relevant snippets, and acts on websites via generated
Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded
code generation, and HTML-T5, new pre-trained LLMs for long HTML documents
using local and global attention mechanisms and a mixture of long-span
denoising objectives, for planning and summarization. We empirically
demonstrate that our recipe improves the success on a real website by over 50%,
and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9%
higher success rate than prior SoTA on the MiniWoB web navigation benchmark and
better accuracy on offline task planning evaluation
Understanding HTML with Large Language Models
Large language models (LLMs) have shown exceptional performance on a variety
of natural language tasks. Yet, their capabilities for HTML understanding --
i.e., parsing the raw HTML of a webpage, with applications to automation of
web-based tasks, crawling, and browser-assisted retrieval -- have not been
fully explored. We contribute HTML understanding models (fine-tuned LLMs) and
an in-depth analysis of their capabilities under three tasks: (i) Semantic
Classification of HTML elements, (ii) Description Generation for HTML inputs,
and (iii) Autonomous Web Navigation of HTML pages. While previous work has
developed dedicated architectures and training procedures for HTML
understanding, we show that LLMs pretrained on standard natural language
corpora transfer remarkably well to HTML understanding tasks. For instance,
fine-tuned LLMs are 12% more accurate at semantic classification compared to
models trained exclusively on the task dataset. Moreover, when fine-tuned on
data from the MiniWoB benchmark, LLMs successfully complete 50% more tasks
using 192x less data compared to the previous best supervised model. Out of the
LLMs we evaluate, we show evidence that T5-based models are ideal due to their
bidirectional encoder-decoder architecture. To promote further research on LLMs
for HTML understanding, we create and open-source a large-scale HTML dataset
distilled and auto-labeled from CommonCrawl
Small-scale proxies for large-scale Transformer training instabilities
Teams that have trained large Transformer-based models have reported training
instabilities at large scale that did not appear when training with the same
hyperparameters at smaller scales. Although the causes of such instabilities
are of scientific interest, the amount of resources required to reproduce them
has made investigation difficult. In this work, we seek ways to reproduce and
study training stability and instability at smaller scales. First, we focus on
two sources of training instability described in previous work: the growth of
logits in attention layers (Dehghani et al., 2023) and divergence of the output
logits from the log probabilities (Chowdhery et al., 2022). By measuring the
relationship between learning rate and loss across scales, we show that these
instabilities also appear in small models when training at high learning rates,
and that mitigations previously employed at large scales are equally effective
in this regime. This prompts us to investigate the extent to which other known
optimizer and model interventions influence the sensitivity of the final loss
to changes in the learning rate. To this end, we study methods such as warm-up,
weight decay, and the Param (Yang et al., 2022), and combine techniques to
train small models that achieve similar losses across orders of magnitude of
learning rate variation. Finally, to conclude our exploration we study two
cases where instabilities can be predicted before they emerge by examining the
scaling behavior of model activation and gradient norms
Recommended from our members
Learning Natural Language Interfaces using Deep Neural Networks
Automating user tasks with natural language utterances, such as answering questionsover Wikipedia or booking flight tickets on the Web, is a key component in designingintelligent systems. Natural language is usually preferred as a unified interface for thesesystems and requires no domain expertise for users; however, understanding wide range ofdiverse inputs and resolving errors that occur during this process are still open challengesand the topics of this thesis.Traditional machine learning systems for natural language interfaces usually requirelarge-scale labeled datasets with handcrafted rules to train and evaluate the performancesof the respective models. Firstly, the handcrafted design constrains the scope to a limitedset of domains and prevents the adaptation to new tasks. Additionally, large-scale labeleddata collection is generally domain dependent, costly, and time consuming. These systemsfurther assume that the underlying database, such as Freebase, is accessible and can bequeried indefinitely which is prohibitive when learning from constrained user interfaces,such as Web pages. Last but not least, current systems focus on training offline in aclosed loop where users are excluded from the system inference process. They lack thecapabilities to continuously learn from users.In this thesis, we address the drawbacks of the existing systems and propose dataefficient and user-centric solutions. We classify the natural language inference problembased on two different perspectives: Accessiblity of the system functions – unconstrainedor constrained user interfaces, and nature of user involvement during inference – non-interactive or interactive user interfaces. We first develop neural network based systemsfor non-interactive and unconstrained users interfaces with different data types (i.e. structured and unstructured). The system is trained to learn a continuous representation ofuser utterance, generate and rank candidate answers from underlying database usingthis representation. We augment these systems with an extractive candidate refinementframework by integrating task-oriented human-machine dialogues. Our system is able tounderstand, point, and refine the error in candidates by asking users validation questionsand offering alternatives. We also address the limitations of unconstrained user interfaces and propose reinforcement learning methods to develop policies that are capable oflearning from more constrained web interfaces. The policies are trained on a variety ofweb pages, such as flight booking and social media interaction, with task-based rewardsignals and no human supervision. We test the performance of our models with simulated as well as real users. Empirical results show that the proposed models are able tolearn from limited supervised data and have successful dialogues with users. We observeimprovements in answer prediction accuracy, task success rate, and real user ratings
Recommended from our members
Learning Natural Language Interfaces using Deep Neural Networks
Automating user tasks with natural language utterances, such as answering questionsover Wikipedia or booking flight tickets on the Web, is a key component in designingintelligent systems. Natural language is usually preferred as a unified interface for thesesystems and requires no domain expertise for users; however, understanding wide range ofdiverse inputs and resolving errors that occur during this process are still open challengesand the topics of this thesis.Traditional machine learning systems for natural language interfaces usually requirelarge-scale labeled datasets with handcrafted rules to train and evaluate the performancesof the respective models. Firstly, the handcrafted design constrains the scope to a limitedset of domains and prevents the adaptation to new tasks. Additionally, large-scale labeleddata collection is generally domain dependent, costly, and time consuming. These systemsfurther assume that the underlying database, such as Freebase, is accessible and can bequeried indefinitely which is prohibitive when learning from constrained user interfaces,such as Web pages. Last but not least, current systems focus on training offline in aclosed loop where users are excluded from the system inference process. They lack thecapabilities to continuously learn from users.In this thesis, we address the drawbacks of the existing systems and propose dataefficient and user-centric solutions. We classify the natural language inference problembased on two different perspectives: Accessiblity of the system functions – unconstrainedor constrained user interfaces, and nature of user involvement during inference – non-interactive or interactive user interfaces. We first develop neural network based systemsfor non-interactive and unconstrained users interfaces with different data types (i.e. structured and unstructured). The system is trained to learn a continuous representation ofuser utterance, generate and rank candidate answers from underlying database usingthis representation. We augment these systems with an extractive candidate refinementframework by integrating task-oriented human-machine dialogues. Our system is able tounderstand, point, and refine the error in candidates by asking users validation questionsand offering alternatives. We also address the limitations of unconstrained user interfaces and propose reinforcement learning methods to develop policies that are capable oflearning from more constrained web interfaces. The policies are trained on a variety ofweb pages, such as flight booking and social media interaction, with task-based rewardsignals and no human supervision. We test the performance of our models with simulated as well as real users. Empirical results show that the proposed models are able tolearn from limited supervised data and have successful dialogues with users. We observeimprovements in answer prediction accuracy, task success rate, and real user ratings
Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization
We tackle real-world problems with complex structures beyond the pixel-based
game or simulator. We formulate it as a few-shot reinforcement learning problem
where a task is characterized by a subtask graph that defines a set of subtasks
and their dependencies that are unknown to the agent. Different from the
previous meta-rl methods trying to directly infer the unstructured task
embedding, our multi-task subtask graph inferencer (MTSGI) first infers the
common high-level task structure in terms of the subtask graph from the
training tasks, and use it as a prior to improve the task inference in testing.
Our experiment results on 2D grid-world and complex web navigation domains show
that the proposed method can learn and leverage the common underlying structure
of the tasks for faster adaptation to the unseen tasks than various existing
algorithms such as meta reinforcement learning, hierarchical reinforcement
learning, and other heuristic agents.Comment: Accepted to UAI 2022 as an oral presentatio